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Software  Voting  in  Asynchronous 
NMR  Computer  Structures 

1 .  Introduction 

Modern  computer  systems  are  being  used  in  environments  that  require  increased  reliability  due  to  the 
nature  of  the  tasks  being  performed,  l-'or  example,  future  avionics  computers  will  replace  tlie  mechanical 
control  of  present  aircraft.  The  computers  will  make  thousands  of  decisions  per  second  concerning  the 
stability  of  the  aircraft.  I  hc  system  must  be  designed  so  that  the  computer  vvill  never  fail  in  flight,  since  the 
subility  decisions  can  not  be  made  by  the  pilot.  In  many  cases,  die  required  reliability  is  being  obtained  by 
replicating  hardware  components,  and  comparing  the  outputs  of  the  components  to  determine  the  cc.rrect 
result,  file  replication  allows  the  system  to  tolerate  failures  in  components  without  affecting  die  system 
reliability. 

One  technique  used  to  improve  the  system  reliability  is  to  replicate  the  hardware  an  odd  number  of  times, 
and  to  compare  the  outputs  of  the  modules  to  determine  whether  a  majority  of  die  modules  agree.  If  a 
majority  do  agree,  dicn  this  output  is  assumed  to  be  the  correct  output.  ITic  comparist)n  to  determine  a 
majority  is  called  voting.  'ITtc  system  that  performs  the  comparison  on  the  module  outputs  is  called  a  voter.  If 
the  hardware  is  replicated  three  times,  then  the  system  has  Triple  Modular  Redundancy  (TMR).  The 
generalization  ofTMR  is  N-Modular  Redundancy  (NMR).  in  which  each  module  is  replicated  N  times,  and 
all  N  outputs  are  voted  on  to  determine  die  correct  output.  Figure  1-1  shows  a  TMR  system  with  four 
modules.  I'hc  touil  number  of  modules  and  voters  required  for  TMR  is  }  (nunibcr  of  >nudules  + 
number  of  communication  paths). 

Triple  Modular  Redundancy  is  a  useful  technique  to  mask  failures  in  a  system.  One  mtxlule  can  fail 
completely  in  a  TMR  system  and  the  output  of  the  system  should  not  be  affected.  The  number  of  redundant 
modules  can  be  increased  if  the  system  reliability  must  be  increased.  The  system  reliability  can  therefore  be 
increased  by  simply  increasing  the  redundancy.  The  cost  of  this  replication  can  be  high.  The  ideal  system 
performance  of  N  processors  is  N  times  the  performance  of  one  processor.  In  a  NMR  system,  though,  all  the 
processors  arc  performing  the  same  task,  so  the  throughput  is  the  same  as  for  one  processor.  In  fact,  the 
performance  will  be  worse  than  that  of  one  processor  because  some  overhead  will  be  associated  with  the 
voting,  thereby  reducing  the  system  throughput.  The  added  reliability  is  exchanged  for  increased  system  cost 
and  decreased  throughput.  Some  applications  require  extremely  reliable  systems,  so  the  only  option  is  NMR. 
Many  modem  computer  systems  use  some  type  of  replication  to  increase  reliability. 

In  N-modular*redundancy  (NMR),  the  redundant  modules  arc  oRcn  computer-memory  pairs.  The 
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#  Processes  =  3x(  #  Nodes  +  #  Arcs) 

Figure  1-1:  Non-rcdundani  Four  Module  System  and  Associated  1  MR  System 

computers  communicate  information  to  bo  voted  on  either  by  hardware  voters  [11],  or  by  software  voters 
running  on  the  processors  (4)  [9],  Software  voting  has  a  number  of  distinct  advantages  over  hardware  voting, 
one  of  which  is  the  flexibility  of  the  voter.  For  example,  the  Software  Implemented  Fault  Tolerant  computer 
(SIF’D  [3],  has  a  voter  that  can  handle  a  5-way  vote,  or  a  3-way  vote.  The  system  can  determine  which  voter 
to  use  depending  on  the  number  of  processors  available.  The  software  voter  routine  can  be  modified  as  the 
system  changes  in  order  to  improve  the  system  reliaL.lity.  Other  reliability  improvement  features  such  as 
dynamic  reconfiguration  can  be  easily  implemented  in  software,  and  have  been  shown  to  improve  system 
reliability  [6J  [14J.  Most  of  the  research  on  NMR  redundancy  has  made  the  assumption  that  the  modules  are 
synchronized  [1).  Since  it  is  very  difficult  to  force  processors  to  be  lightly  synchronized,  this  assumption  docs 
not  hold  for  a  large  class  of  systems.  Some  researchers  arc  beginning  to  realize  that  asynchronous  systems 
offer  distinct  advantages  in  reliability  [9]  and  simplicity.  The  problem  remains  though  of  how  to  design  an 
asynchronous  system  that  meets  the  reliability  objectives. 


A  general  purpose  multiprocessor  called  Cm*  was  used  to  experiment  with  NMR  computer  systems  (5). 
Cm*  has  an  operating  system  named  Medusa  that  provides  primitives  for  experimentation,  and  an 
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experimental  interactive  synthetic  workload  generator  that  provides  an  enviioniiient  conducive  to  monitoring 
Cm*  performance.  Cm*  is  a  50  processor  (DF.C  I.Sl-lls)  multiprocessor  connected  by  a  hierarchical, 
distributed  switching  structure.  Fach  processor  is  connected  to  a  local  incinory  to  form  a  Computer  module 
called  a  Cm.  ITie  processor  is  connected  to  the  local  memory  by  a  switch  called  an  Slocal.  I'hc  Cm's  arc 
connected  together  into  clusters  by  a  high  speed  bus.  A  high  speed  microprogrammable  bus  controller,  a 
Kmap,  controls  access  to  this  packet  switched  bus,  as  well  as  providing  access  to  other  Kmaps,  and  tlicir 
associ.ited  clusteis.  The  Kmaps  are  each  connected  to  other  Kmaps  by  two  intcrcluster  busses.  All  Cms  can 
.access  all  the  memoo’  in  Cm*  through  the  Slocals  and  die  Kmaps.  The  memory  reference  hierarchy  consists 
of  !oc;al  references,  intraclusier  references,  and  intcrcluster  references. 

Hie  .Medusa  Operating  System  is  a  message  based,  object  oriented  operating  system  designed  to  exploit  tlie 
architecture  of  Cm*  [10].  It  was  designed  w'ith  modularity,  robustness,  and  perfonmance  in  mind.  The 
functions  of  the  operating  system  are  partitioned  into  Task  Torces  which  are  sets  of  closely  cooperating 
parallel  processes.  The  prixesscs  can  communicate  via  messages  pavsej  through  a  communication  medium 
called  a  pipe.  Medusa  provides  fur  Conditional  and  T'nconditicnal  -Sends  and  Receives.  .Ml  data  in  Medusa  is 
stored  in  system  defined  objects.  Objects  can  be  accessed  ihrougli  private  or  shared  descriptors.  The  Kmap 
and  the  Slocal  cooperate  to  convert  a  descriptor  into  a  physical  address.  Operating  system  functions  such  as 
message  communication,  address  mapping,  interrupt  handling,  activity  multiplexing,  and  mutual  exclusion 
arc  pcrfivrmcd  by  tlie  Kmap  microcode. 

The  synthetic  workload  generator  (SVVG)  [13]  is  a  tool  ainning  under  Medusa  tJiat  provides  a  controllable, 
interactive  experimentation  environment  for  Cm*.  The  SVVG  provides  a  iisei  interface  that  allows  interactive 
experimentation.  The  SWG  allows  the  user  to  vary  experimental  parameters  at  runtime,  so  that  experimental 
data  can  be  collected  easily.  All  experiments  for  the  SVVG  are  represented  as  a  data  flow  graph.  Processes  are 
represented  as  nodes  of  the  graph.  The  communication  of  information  is  over  arcs  on  die  graph.  Buffers  arc 
used  to  store  messages  being  passed  between  nodes.  Ilic  synthetic  workload  is  made  up  of  repetitions  of 
operations  from  a  library  of  actions.  The  actions  arc  designed  to  simulate  real  operations.  Various  control 
structures  supported  by  the  SWG  facilitate  the  suming  and  stopping  of  experiments.  The  SVVG  and  die  data 
flow  model  were  used  in  the  experiments  described  in  Sections  4  and  5. 

This  paper  is  divided  into  six  sections.  Section  I  provided  an  overview  of  the  field  of  highly  reliable 
systems,  a  justification  for  the  work  presented,  and  an  overview  of  the  research  vehicle  and  tools  used  during 
the  research.  Section  2  introduces  the  concepts  involved  in  voting,  including  synchronization  issues,  voting 
frequency,  and  voted  data.  Section  3  presents  die  experimental  paradigm.  The  types  of  voters  used  in  die 
experiments  are  also  described.  A  scries  of  experiments  that  describe  die  voting  overhead  in  a  TMR  software 
voting  system  arc  presented  in  Section  4.  A  theoretical  framework  is  developed  for  a  voting  overhead  model. 
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and  experimental  results  arc  compared  to  results  predicted  h>  the  model.  Section  5  presents  experiments 
designed  to  explore  how  closely  synchronized  voting  systems  must  remain.  Variation  in  process  execution 
speed  IS  used  to  determine  how  much  asynchrony  is  acceptable  in  NMR  systems.  The  results  of  the 
experiments  yield  simie  guidelines  for  designing  asynchronous  NMR  systems.  An  analysis  of  die 
synchronization  data  is  also  presented.  Hquations  are  developed  that  can  predict  the  amount  of  variation  in 
process  execution  speed  that  is  acceptable  for  reliable  system  operation.  .\  queuing  theory  model  is 
developed  to  describe  the  voter-subtask  relationship.  The  results  predicted  by  the  model  arc  compared  to  the 
actual  experimental  .  esults. 

2.  Voting  Concepts 

This  section  is  concerned  with  giving  an  overview  of  voting  systems  and  with  presenting  the  issues  involved 
in  voting.  Triple  Modular  Redundancy  ( I’MR)  was  llrst  proposed  in  1956  by  von  Neumann  [15].  Since  that 
time.  TMR  systems  have  been  built  and  evaluated  [2]  [7]  [14]  [16]  [17],  l  echniques  have  been  used  to  improve 
the  reliability  vif  TMR  systems,  and  some  of  tliese  techniques  arc  presented  below.  In  addition,  some  new 
concepts  that  relate  particularly  to  software  v  oting  arc  presented. 

The  design  of  redundant  systems  is  intended  to  improve  their  reliability  by  replicating  a  module  .V  times, 
and  comparing  the  outputs  of  the  M  modules.  'Hie  comparison  should  take  the  .V  module  outputs,  and  choose 
the  most  likely  output  as  the  actual  output.  The  comparison  has  taken  many  forms  over  the  years,  but  a 
simple  majority  vote  is  the  most  popular.  A  majority  ([.V/24- 1 J)  of  the  modules  must  agree  on  a  value  for  a 
particular  output.  Since  most  computer  systems  use  a  binary  representation  for  tJic  data,  the  voter  simply 
needs  to  compare  the  data  bibby-bit. 

In  an  N.MR  sy  stem,  if  only  one  voter  is  used  to  determine  the  correct  module  output,  then  the  failure  of  the 
voter  becomes  a  catastrophic  event.  The  voter  is  called  a  single  point  of  failure.  If  the  voter,  howev  er,  is  also 
replicated  .V  times  then  the  single  point  of  failure  has  been  removed.  The  systems  considered  in  tliis  repon 
are  all  NMR  with  no  single  points  of  failure,  generally  with  .V=3  (  I  MR).  Systems  that  are  TMR  with  no 
single  points  of  failure  can  mask  a  single  permanent,  intermittent,  or  transient  error  in  eiiJicr  the  voters  or  the 
modules. 

If  the  modules  to  be  replicated  arc  software  modules,  then  each  module  can  execute  on  its  own  processor, 
concurrent  with  the  execution  of  other  modules.  The  replicated  modules  that  are  executing  the  same  task  are 
not  necessarily  executing  at  the  same  time.  The  replicated  modules  will  have  completely  separate  code  and 
data,  so  they  can  be  called  space  redundant.  In  addition,  the  modules  can  execute  at  different  times,  which  is 
called  time  skew  redundancy.  If  the  system  uses  time  skew  redundancy,  then  the  system  may  be  able  to 
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tolerate  imilliple  failures  at  one  time,  since  the  failures  \^ill  affect  differem  conipuuitional  tasks.  In  a  1  MR 
s>stem.  iliree  simultaneous  failures  could  be  tolerated  if  the  system  was  both  time  skew  and  space  redundant, 
and  no  more  errors  occur  until  die  utters  correct  the  three  faults. 

In  order  to  vote  on  the  outputs  of  modules,  the  voters  must  have  some  knowledge  of  when  tlie  outputs 
become  valid.  Since  die  modules  may  have  different  clocks,  die  voters  must  be  .ible  to  no;/  for  modules  to 
prepare  outputs,  before  voting  on  d'.eni.  1-ven  if  die  modules  have  the  same  clock  (which  would  be  a  single 
point  of  failure,  sr)  should  probably  be  avoided),  clock  skew  and  differences  in  logic  delay  would  introduce 
die  need  for  die  voters  to  wait  for  the  outputs  to  all  become  valid.  The  wait  time  could  be  implicit,  as  in  Sll'l'. 
such  that  the  vote  occurs  at  a  predetermined  time  (if  die  module  cannot  produce  die  output  in  time,  then  die 
vote  prweeds  without  that  output).  Cimverscly.  the  wait  can  be  explicit,  .is  in  die  Cm*  voting  experiments 
presented  in  Sections  4  and  5.  such  diat  the  voter  waits  fi'r  a  signal  from  the  module  indicating  die  output's 
validity  .  In  the  case  of  explicit  waiting,  the  voter  should  not  wait  indefinitely  for  the  module  to  signal,  since 
the  module  may  fail  in  such  a  way  as  to  never  produce  the  signal.  The  voter  should,  in  diis  case,  have  a 
time-out  to  prevent  indetlnite  waiting.  I'wo  ty  pes  of  time-outs  arc  possible.  A  module  external  to  die  voter 
could  internipt  the  voter  after  a  period  of  time.  This  requires  a  cltKk  to  determine  the  time,  so  is  called  a 
clock  driven  time-out.  llic  second  possibility  is  an  event-driven  time  out.  A  number  of  possible  events  could 
trigger  a  time-out.  but  in  the  experiments  in  Sections  4  and  5  the  time-out  tKcurs  after  the  voter  receives  n 
messages  from  one  module  without  receiving  any  messages  from  another  module. 

When  the  module  outputs  become  valid  a  voter  can  determine  the  majority,  and  generate  its  own  output 
called  the  voted  output.  The  point  in  time  when  the  module  outputs  all  become  valid  is  called  a  point  of 
synchronization,  since  the  system  will  be  synchronized  with  respect  to  the  module  outputs  at  diis  point  in 
time.  The  voter  must  wait  for  at  least  a  majority  of  the  outputs  before  it  can  decide  on  die  correct  voted 
output,  so  at  least  a  majority  of  the  modules  must  reach  the  synchronization  point  before  the  vote.  If  the  voter 
docs  not  wait  for  all  the  modules  to  generate  outputs  ,  but  only  a  majority,  then  this  is  called  a  point  of  partial 
synchronization. 

The  amount  of  work  done  between  votes  can  be  small  (a  few  instructions)  or  large  (thousands  of 
instructions).  I'hc  tradc-ofT  in  determining  die  voting  frequency  is  throughput  versus  reliability.  As  the 
frequency  of  voting  is  increased,  the  overhead  due  to  voting  becomes  greater.  This  decreases  the  throughput, 
but  will  increase  the  reliability.  In  general,  a  TMR  system  can  tolerate  one  error  between  votes.  However, 
there  is  a  probability  that  two  errors  will  occur  between  votes.  The  assumption  will  be  made  that  the  system 
can  not  recover  from  two  such  errors.  Given  an  error  rate,  the  system  should  vote  frequently  enough  so  that 
no  two  errors  arrive  between  votes.  A  task  to  be  performed  will  take  longer  to  execute  as  the  granularity  of 
voting  is  decreased,  due  to  the  overhead  introduced  by  each  vote.  The  probability  of  two  errors  occurring 
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hct^^ccn  votes  will  Ueerease  as  the  giaiuilarity  decreases,  until  the  voter  execution  time  dominates  tlie  toUil 
execution  time.  Any  further  decre.ise  in  granularity  will  have  little  olTect  on  the  probability  of  two  errors 
occurring  between  votes,  but  die  total  task  execution  time  will  continue  to  increase,  rhereforc.  tlie 
probability  of  a  system  failure  sometime  during  the  Lisk  execution  will  increase.  .\s  the  voter  takes  a  larger 
percentage  of  the  touil  execution  time,  tlie  voter  becomes  the  module  that  is  more  likely  to  fail.  ITie  system 
reliability  decreases  if  the  granularity  is  decreased  past  tliis  point. 

There  are  many  issues  involved  in  choosing  the  amount  and  kind  of  data  to  be  voted  on  by  the  voter.  One 
of  the  first  decisions  made  in  designing  a  NMR  system  is  to  choose  the  data  to  be  voted  on.  Systems  can  be 
designed  tliat  would  vote  on  the  actu.il  data  used  in  a  module.  I  he  actual  dat.i  would  include  processor  state 
tliat  is  unimportant  to  tlie  value  of  the  outputs.  For  example,  if  a  program  is  relocatable,  then  the  program 
counter  may  be  dilTerent  for  e.ich  prrKcssor.  1  he  results  produced  by  the  program  will,  however,  be  identical. 
In  a  system  that  votes  on  actual  data,  the  programs  being  executed  must  all  be  placed  in  tJie  same  memory 
'•pace,  and  tlie  programs  have  no  ilexibility  in  independently  choosing  any  parameters.  .\  more  tlexible 
lystem  might  alUiw  modules  to  act  independently,  only  voting  on  die  parameters  that  affect  the  outputs. 

Once  the  data  has  been  passed  to  die  voter,  the  voter  has  some  options  on  how  to  determine  the  majority. 
■|'hc  v  oter  could  choose  to  compare  bits,  words,  or  an  entire  array.  The  type  of  data  to  be  compared  is  called 
the  data  granularity.  The  choice  of  data  granularity  makes  a  difference  in  system  reliability.  If  die  data  is 
voted  on  bit-by-bit.  it  is  guaranteed  that  a  majority  will  be  found.  There  are  only  two  possible  values  and  one 
will  be  the  majority.  If  a  majority  of  the  bits  arc  in  error  dicn  the  voted  value  will  be  incorrect.  If  a  larger 
data  granularity  is  chosen,  for  example  an  n-bit  word,  then  the  voter  can  reach  three  decisions.  All  three  can 
agree  on  the  value,  two  can  agree  on  the  value,  or  all  three  can  disagree.  In  this  case,  the  detectability  of  errors 
is  improved,  since  the  probability  of  having  two  incorrect  words  that  agree  is  less  than  having  two  incorrect 
bits  that  agree.  Word  voting  is  less  likely  to  produce  an  incorrect  answer  which  may  cause  catastrophic  errors 
in  other  modules.  The  voter  can  detect  when  all  three  disagree,  and  a  recovery  routine  can  decide  how  to 
handle  the  faults.  Fven  though  the  voter  provides  no  answer,  this  is  preferable  to  providing  the  wrong 
answer.  If  the  data  granularity  is  increased  again,  tlien  the  probability  that  two  incorrect  data  values  agree  is 
decreased.  If  an  entire  array  is  compared  to  two  other  arrays,  the  probability  of  hav  ing  two  faulty  but  equal 
arrays  is  smaller  than  the  probability  of  having  two  faulty  but  equal  words.  The  array  eould  contain  two 
correctable  errors,  yet  the  voter  would  not  correct  either  because  the  data  granularity  is  large.  I  he  ideal  value 
of  the  data  granularity  should  be  when  the  probability  of  having  two  correctable  errors  in  the  data  equals  the 
probability  of  having  two  incorrect  data  values  agree.  A  small  data  granularity  allows  the  voter  to  correct 
many  errors,  and  a  large  granularity  reduces  the  probability  of  allowing  incorrect  data  to  pass  the  voter.  A 
voter  could  obtain  better  detectability  and  corrcctability  by  using  a  small  data  granularity  to  correct  errors  and 
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.1  Ijrgc  data  graiuilaiU)  lo  dcicci  cirois.  This  \i)ier  wcuild,  lutwcscr  I’  a  greater  execution  time  than  a 
simple  voter. 

Gener.illv.  soi'tware  voters  do  not  vote  hit-bv-hii.  since  processors  are  designed  to  handle  hvtes  or  words 
better  than  bits.  It'Uie  three  words  passed  to  tlie  voter  arc  .V.),  and  /.  tlien  tlie  comlvin.iton.il  majoritv  vote 
IS  defined  as: 

(  =  ,V  )  7-.V  /+  >■  / 

If  tlie  v.iliies  of  A’. and  /  .ire  words,  then  a  bit-wise  vote  will  proceed  in  parallel  for  all  n  bits  in  the  word. 
I'he  generation  of  tlie  vtued  data  value  with  this  method  uikes  just  tliree  bit-wise  AM)  operations  and  two 
bit-wise  OR  t'perations.  Ihe  comp.irison  voter  tlrat  is  popular  in  manj  softw.ire  voting  sv stems  reqtnres  at 
least  three  comparisons,  and  two  branches.  Ihe  combinatorial  majoritv  voter  has  straiglit  in-line  code  tliat 
could  be  pipelined  on  .1  speci.il  purpose  machine  to  improve  perftirinance.  where  the  comp.inson  voter  can 
not  be  pipelined.  I'he  ctimbmatorial  majoritv  voter  therefore  requires  less  execution  time  than  tlie  classical 
comparison  voter,  and  increases  the  probabilitv  of  correcting  independent  errors. 

In  addition  to  choosing  the  data  granularity,  other  parameters  of  the  data  must  be  chosen.  It  may  be 
desirable  to  vote  on  some  abstract  data  structures,  to  determine  if  the  data  tJiey  contain  is  equal.  Some 
interesting  problems  arise,  due  to  the  naiurc  of  some  data  .stnjcture.s.  For  example,  a  linked  list  da, a  structure 
may  be  passed  to  a  voter  by  three  modules.  The  voter  should  vote  on  tire  data  in  the  linked  list,  but  should 
not  vote  on  pointers  to  data  items.  The  lists  should  have  tlie  same  structure,  and  the  same  data,  but  not 
necessariiv  the  same  pointers.  This  procedure  requires  an  intelligent  voter,  with  knowledge  of  linked  lists,  and 
with  knowledge  of  the  storage  format.  Other  interesting  data  staictures.  such  as  queues  or  stacks  could  be 
u.sed  as  inptits  to  tlie  voters.  Abstract  data  strtictures  arc  commonly  used  in  high  level  programming 
languages,  so  tlie  voters  should  be  able  to  handle  tlicm.  An  NMR  system  should  attempt  to  accommodate  the 
programmer,  not  the  other  way  around.  Although  no  systems  provide  abstract  voting  yet.  as  more 
applications  arc  written  for  NMR  systems,  the  programmers  arc  going  to  discover  the  advantages  of  having 
voters  that  can  handle  abstract  data. 

3.  Experimental  Paradigm 

The  two  types  experiments  performed  use  a  similar  paradigm.  The  paradigm  can  be  viewed  at  the  highest 
level  as  the  execution  of  a  single  task.  The  task  to  be  performed  is  broken  into  equal  subtasks.  Hach  subiask 
is  executed  in  order,  with  data  being  passed  from  one  subiask  to  the  next.  It  is  assumed  that  each  subtask  has 
the  exact  same  execution  speed,  and  that  only  one  word  of  data  is  passed  from  one  subtask  to  the  next.  Since 
the  subtasks  all  have  the  same  execution  speed,  the  task  can  be  simulated  by  a  loop  that  executes  n  times  with 
a  synthetic  workload  that  takes  subtask,  time  inside  the  loop.  Figure  3-1  shows  the  partitioning.  F.ach  subtask 
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Figure  3-1:  Hxpcrimcm  Task  Pariiiioning 


Figure  3-2:  TMR  Kxpcrimcnial  Structure 


is  triplicated,  and  a  vote  occurs  on  the  data  passed  between  subtasks,  yielding  the  structure  in  Figure  3-2, 


The  triplicated  subtasks  all  perform  the  same  function.  They  will  calculate  the  /'*  data  value,  send  a  copy  of 
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[lie  da(a  10  each  \oter.  and  iccene  tlie  lotcd  \aliic  of  the  data  from  die  associated  witcr.  Hie  ^ev^  data  value 
is  llien  used  in  e.ileulatiiig  the  (/+  IP'  daUi  value.'  I’hc  time  each  subt.isk  takes  to  ealcul.ite  tlie  data  v.iluc 
IS  an  experimental  variable.  .Ml  of  die  triplicated  subuisks  will  have  a  variable  execution  time.  I  his  time  is  set 
b>  the  granularity  of  die  suhtask.  which  is  defined  as  die  luiniber  of  operations  executed  between  votes  not 
including  the  overhead  due  to  voting.  An  operation  is  four  I  .SiMl  instructions.  The  granularity  of  each 
subtask  can  be  set  before  an  experiment. 

I  he  voter  subtask  is  also  triplicated,  .is  shown  in  Figure  3-2.  H.ich  subtask  sends  each  voter  two  data  words. 
I'lie  first  data  word  is  a  sequence  number  to  assiKiate  daLi  with  an  iteration.  I'lic  second  word  is  die  data  to 
be  compared  b_v  die  voter.  When  a  voter  has  received  daui  from  a  ni.ijori[y  of  the  siibtasks  (two),  it  checks  to 
see  if  the  data  values  .igrcc.  If  so.  then  a  m.ijoriiy  vote  has  been  achieved,  and  the  data  value  is  sent  to  die 
sLiliLisk  assiKiated  with  this  voter.  If  they  do  not  agree,  then  the  voter  waits  for  die  data  value  from  the  diird 
subt.isk  to  determine  the  correct  value,  which  is  sent  to  the  associated  suhtask.  Kach  voter  and  subtask  is 
.issigned  its  tiwn  processor,  so  each  voter  proceeds  with  the  voting  in  parallel  w  itli  the  subtask  execution. 

Three  types  of  voters  arc  used  in  the  experiments.  The  first  voter,  called  the  simple  voter,  is  a 
synchronizing  voter.  It  requires  the  subtasks  to  reach  a  full  point  of  synchronization  after  each  subtask 
iteration.  It  has  no  internal  storage  of  data  from  one  iteration  to  the  next.  ITic  second  voter,  called  the 
internal  queue  voter,  has  an  internal  queue  tlut  allows  it  to  handle  data  from  different  iterations.  'The 
subtasks  arc  not  required  to  fully  synchronize  after  each  iteration.  This  voter  has  been  optimized  for  high 
execution  speed  in  the  average  case  and  therefore  has  the  shortest  execution  time.  The  diird  voter,  called  the 
sequence  number  voter,  uses  the  sequence  numbers  that  arc  sent  by  the  subtasks,  so  that  the  voter  can  order 
data  based  on  the  subuisk  iteration.  This  voter  has  the  longest  execution  time.  .Ml  three  voters  were  designed 
to  allow  easy  expansion  to  N-way  voting.  The  algorithms  for  the  voters  arc  presented  in  Appendix  1. 

As  long  as  the  subtasks  have  similar  execution  speeds,  die  voter  should  receive  the  iteration  from  each 
subtask  at  approximately  the  same  time.  I'hc  sequence  number  voter  and  die  internal  queue  voter  do  not 
rcqtiirc  a  full  point  of  synchronization,  so  if  one  subtask  is  slower  dian  die  other  two  then  the  voter  may 
receive  the  (/-F  1)"  data  value  from  a  fast  subtask  before  the  slow  subuisk  sends  the  i''‘  data  value.  Since  the 
voter  now  has  data  from  two  different  iterations,  it  must  be  able  to  distinguish  which  data  is  associated  with 
which  iteration,  and  from  which  subtask.  A  voter  queue  is  used  to  maintain  diis  database.  Fach  row  in  the 
queue  contains  information  about; 


One  can  imagine  wanting  to  pa.ss  more  than  one  data  value  from  one  subiask  to  the  next  This  can  be  done  with  a  more  complicated 
voter  The  enure  state  of  a  processor  (or  selected  parts)  could  be  passed  as  data,  allow  ing  a  faulty  processor  to  recov  er  from  a  transient  by 
accepting  the  voted  state  as  its  new  sutc.  Adding  this  capability  to  the  experiments  would  complicate  them  without  yielding  additional 
information  about  the  voting. 
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1.  vxliich  itcraru)!!  Oils  row  represents. 

Z.  whether  data  has  .irrued  from  each  soiiree  subutsk. 

3.  what  die  daLi  valtie  is  from  a  xoiirce  subtask  (if  it  has  arrived). 

I  he  column  the  dau  is  stored  in  implicitly  identifies  the  associated  destination  stibtask. 

The  sequenee  number  \oter  ilien  searchs  for  an  iteration  number  m  the  voter  gueuc  to  find  the  row  where 
the  daui  for  this  suhtask  belongs.  If  the  iteration  number  is  not  found  in  the  qtieiie,  a  row  fttr  this  iteration  is 
pl.iced  in  the  queue  and  the  data  is  placed  in  tlte  row.  Wlien  all  of  the  data  values  for  a  particular  row  have 
arrived,  the  voter  reports  .m>  errors  found  while  voting  and  tlien  removes  die  row  from  die  queue. 

I'he  voter  queue  has  a  finite  maximum  length.  If  one  subtask  has  not  sent  any  data  to  the  voter  in  the  s;imc 
perii'd  in  wtuch  the  other  two  subtasks  liave  sent  m.my  data  messages,  the  voter  queue  could  conceivably 
become  full.  The  voter  handles  a  full  queue  by  removing  the  oldest  row  (associated  with  an  iteration  for 
whieh  all  the  data  has  not  arrived)  from  the  queue  and  adding  a  row-  associated  with  the  new  iteration 
number.  Hrrors  are  reported  on  the  row  removed  from  the  queue.  The  maximum  length  of  die  queue  can  be 
large,  so  chat  the  queue  w  ill  never  become  full  in  experiments. 

4.  Voter  Overhead  Experiments 

In  any  N-modular  redundancy  (NMR)  system,  the  amount  of  useful  work  done  will  be  less  than  the 
corresponding  non-replicated  system.  The  voting  that  is  done  in  a  NMR  svstem  will  introduce  some 
overhead  that  will  reduce  the  system  throughput.  I'he  overhead  will  be  made  up  of  many  different 
components,  including  die  communication  lime  between  modules  and  the  time  required  by  the  voters  to 
receive  messages  and  find  the  majority,  in  this  section,  voting  overhead  is  discussed  and  a  model  is  developed 
to  describe  voting  overhead. 

In  order  to  develop  a  model  for  voting  overhead  one  must  determine  a  mcdiod  for  representing  overhead, 
and  must  determine  what  parameters  affect  the  overhead.  One  possible  representation  for  overhead  is  in 
additional  operations  executed  per  unit  time  (opcralions/second)  due  to  redundancy.  I  he  actual  throughput 
is  the  number  of  subtask  operations  performed  per  unit  time.  As  the  actual  throughput  goes  down,  the 
overhead  goes  up.  Mathematically: 

Overhead^  Non  redundant  Throughput—  Actual  Throughput  (1) 

In  this  section,  the  actual  throughput  is  determined,  and  the  overhead  can  be  calculated  from  the  above 
equation.  The  non-redundant  throughput  is  a  constant  for  a  given  system. 
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H.iuii  subutsk  is  cxcciiiiiig  an  inslnictiim  sequence  itcratixely.  Since  each  iteration  is  identical,  die  total 
oxerhead  is  the  number  of  iterations  times  the  ox ei head  for  one  iteration.  \  subiask  performs  \xork  for  each 
iteratixm  and  the  amount  of  \xork  is  called  the  granularity,  (i.  Since  the  loLil  amount  t)f  xxork  to  be  performed 
is  a  constant,  li  '.  tlicn  tlic  number  of  subtask  iterations,  /,  is: 

l=l\7C,  or  IV=10  (2) 

In  other  words,  if  tlic  total  work  is  lOO  units,  and  5  units  arc  performed  per  iteration.  tJicn  20  iterations  must 
be  performed. 

As  an  experiment  is  performed,  the  total  execution  time  is  measured.  I'he  execution  time.  /;.  is  tlic  time 
from  when  a  subuisk  begins  the  first  iteration  until  die  subtask  finishes  die  last  iteration.  I'he  Throughput.  T. 
therefore  is: 

T=  W/tf  (3) 

The  total  time.  //■.  can  bo  expressed  as: 

//■=  /.—un'  Total  number  of  instrucliuns  (4) 

where  is  the  axerage  instruction  execution  time,  and 

Total  number  of  insiruciions=  I  {a-G+  k)  (5) 

where  a  is  the  number  of  instructions  executed  by  a  subtask  when  G=1  and  k  is  the  total  overhead  per 
iteration,  including  voting.  Therefore  from  Hquations  4  and  5 

ir=  >i-aveT{aG+  k)  (6) 

From  Equations  2  and  3, 

T= _ ^ _ 

li-aveTi<^(j+k.) 

= _ i _  (7) 

k/G) 

Tlic  throughput,  then,  is  inversely  proportional  to  the  average  instruction  execution  time,  the  number  of 
instnictions  per  subtask  iteration,  and  the  number  of  overhead  instructions  over  the  Granularity  (k/G).  The 
values  of  k.  t,-ayg.  and  a  arc  experimental  constants,  so  we  can  plot  the  diroughput  versus  the  granularity.  For 
typica’  values  of  k,  and  a  (^  =  800,  t/_ay,=6.5/is,  a=4).  the  curve  is  shown  in  F'igurc  4-1. 

The  previous  overhead  model  is  both  general  and  accurate.  Although  selection  of  the  value  of  k  (the 
subtask  overhead  per  iteration)  is  difficult,  a  careful  approximation  to  k  can  be  found. 

Cm*  was  used  as  the  experimental  vehicle.  The  Voter  and  Subtask  softw-are  routines  were  triplicated  and 
each  placed  on  their  own  processor.  The  number  of  iterations.  /,  and  the  granularity.  G.  were  varied  during 
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Figure  4*1:  Predicted  Voting  Overhead 

the  experiments.  The  execution  time,  ip  was  recorded  for  each  set  of  values  of  /  and  G.  1  he  total  work  done. 
W-',  was  kept  constant  by  chosing  a  value  of  /  and  calculating  the  value  of  (7.  The  value  of  IT  was  chosen  to  be 
16.384  operations.  I  he  throughput  was  calculated  for  each  execution  time.  All  three  types  of  voters  described 
previously  were  used  in  this  experiment.  From  the  overhead  model  is  can  be  seen  that  changing  the  type  of 
voter  should  only  affect  the  value  of  k  in  Fquation  7.  llie  throughput  versus  the  granularity  is  plotted  for 
various  voter  changes  in  F'igure  4-2.  Hven  when  the  voter  is  changed  significantly,  the  change  in  tliroughput 
seems  to  be  small. 

'ITie  model  is  extremely  accurate  in  predicting  the  overhead  in  a  system.  One  problem  with  die  model, 
hinted  at  earlier,  is  the  difficulty  in  finding  values  for  the  constant  k.  The  value  should  be  predictable  by 
adding  the  instruction  execution  times  in  the  Subtask  and  the  \\.cer.  but  some  of  the  instructions  used  do  not 
have  predictable  execution  times  due  to  factors  like  system  load.  In  addition,  some  of  the  v  oting  and  subtask 
execution  arc  performed  in  parallel,  so  instruction  counts  would  give  an  upper  bound  on  k.  but  not  an 
'  accurate  value.  The  amount  of  parallelism  is  difficult  to  quantify  without  seriously  perturbing  the 
experiment.  Therefore,  the  value  of  k  used  in  Figure  4-1  was  estimated  using  experimental  results.  The 
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Figure  4-2:  Actual  Voting  Overhead  for  Various  Voters 

comparison  of  the  predicted  and  actual  curves,  however,  loses  credibility  since  the  values  of  k  for  the 
predicted  curves  must  be  experimentally  determined. 

The  value  of  k  can  be  given  an  upper  bound  for  the  non-error  case.  ITie  upper  bound  will  change  as  the 
voter  changes,  but  for  any  given  experiment  the  upper  bound  can  be  determined.  For  the  optimi/.ed  voter 
and  subiask  experiment,  this  upper  bound  has  been  found  by  adding  the  instruction  execution  times  for  the 
subtask  overhead  and  the  voter  time.  The  actual  value  of  k  will  be  less  than  this  time  because  the  voter  will  be 
executing  simultaneously  with  the  subtask.  An  upper  bound  on  k  is  approximated  by: 
k  “  k  k 

^max  ^s-max  '  ^v-^max 

where  is  the  maximum  subtask  contribution  to  k  and  ^v-max  maximum  voter  contribution  to  k. 

By  analyzing  the  pu  erams  written  for  the  experiments,  it  is  found  that: 

kj-max-^^  in  . .  ctions+  3  Sends  + 1  Receive 

^y-max=237  inn  JClions+  3  Conditional  Receives+ 1  Send 
The  execution  times  for  sends  and  receives  on  Cm*  Medusa  arc  given  in  (12).  The  average  execution  time  for 
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l.Sl-1 1  Insirucliiins  in  Uic  \olcr  and  ihc  subtask  \xas  determined  to  be  b.5/i.s.  L  sing  tins  information.  is 
determined. 

1.51- 11  instructions 

1. 51-  11  instructions 

A  <  333 -f  471  =804  I  .SI-1 1  instnictions  (8) 

Similarly,  the  lower  bound  can  be  approximated  by: 

Aj_ot,„  =  68  insiiucli(ins+  3  Sends  +  1  Rtxeive 

ky^„i„=  127  iiis/nH'ii(iiis  +  2  ConJiiiuiuil  Receivcs+  1  Send 

^min~  OKJ.Xt 

A>333  I -Sl-1 1  instructions  (9) 

l-'quation  9  assumes  maximum  simultaneous  execution  of  the  subuisk  and  the  \oter.  The  experiments  with 
the  optimized  \oter  yielded  values  of  A  between  350  and  712.  I'hcse  experimental  results  fall  between  the 
minimum  and  maximum  theoretical  values  calculated  above.  I'hc  bounds  should  be  recalculated  if  lire  voter 
or  subtask  is  changed.  I'igurc  4-3  compares  the  minimum  and  maximum  predicted  curves,  and  an 
experimental  curve  (for  the  optimized  voter).  One  result  that  the  mode)  docs  not  take  into  account  is  tJiat  the 
value  of  A  changes  .is  the  Granularity  changes.  During  the  optimized  voter  experiment.  iJie  value  of  A  varied 
by  over  350  instructions.  Ibis  is  due  to  the  change  in  load  on  the  Kmap  priKcssors  as  the  Granularity 
changes.  The  model  assumes  that  the  value  of  A  stays  constant  tJiroughout  the  experiment.  In  spite  of  iliese 
deficiencies,  the  overhead  model  does  give  .iccuratc  predictions  of  expected  voting  overhead. 

5.  Voter  Queue  Length  Experiments 

In  an  asynchronous  NMR  computer  system,  the  processors  will  have  their  own  cl<x:ks  and  will  make  little 
or  no  effort  to  synchronize  the  clixrks  with  each  other.  ITic  random  variation  in  cltxk  speed  and  the 
difference  in  process  execution  patterns  will  cause  differences  in  the  arrival  times  of  the  data  to  be  voted  on 
by  the  voters.  The  voters  should  be  able  to  receive  daui  asynchronously  so  that  they  can  vote  on  the  data 
when  a  majority  of  the  processes  have  sent  it.  The  voters  must  be  able  to  store  message  values  so  that  one 
processor  can  be  calculating  the  10'^'  step  in  a  prtKcdurc  while  another  processor  can  he  working  on  the  12'* 
step.  Kvcntually  both  processors  should  finish  the  procedure  but  as  long  as  no  data  dependencies  exist,  one 
processor  should  not  be  forced  to  wait  for  another  to  finish  a  calculation.  Hven  when  data  dependencies  do 
exist,  when  a  majority  of  the  processors  agree  on  the  value  of  a  step,  there  is  no  reason  to  wait  for  the  rest  of 
the  processors  to  finish  before  continuing  with  the  next  step.  In  fact,  waiting  can  reduce  reliability  if  a 
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Figure  4-3:  Comparison  of  Actual  and  Predicted  Voting  Overhead 

processor  is  feulty  since  it  may  never  respond  to  the  voter.  There  should,  however,  be  a  limit  to  the  amount  a 
prtKCssor  should  be  allowed  to  fall  behind  before  it  is  considered  faulty.  The  random  variation  may  cause 
problems  if  one  prcKCSsor  becomes  hopelessly  behind  due  to  the  variation.  Hxperiments  base  been 
performed  to  discover  the  nature  of  how  variations  in  prtKcss  execution  speed  affect  tJie  amount  a  prrxess 
falls  behind  the  others.  The  ctTects  of  variation  in  process  execution  speed,  as  well  as  variation  of  the  number 
of  instructions  executed  between  votes  have  been  examined. 

Three  experiments  have  been  performed.  Hach  is  designed  to  explore  a  different  area  of  tlic 
synchronization  problem.  Kxperiment  one  has  a  single  prixess  execute  more  instmetions  for  every  step  in  the 
experiment.  ITiis  process  is  continuously  slower.  This  experiment  shows  that  the  voter  overhead  increases  as 
the  slow  process  falls  behind.  Hxperiment  two  has  one  prtKcss  slower  for  a  period,  followed  by  being  faster 
for  a  period.  Experiment  three  has  one  process  slower  for  a  period,  followed  by  a  period  of  normal  speed. 
This  experiment  is  realistic  for  many  systems,  since  processes  are  likely  to  fall  behind  in  a  system  but  arc  not 
likely  to  speed  up. 
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5.1 .  Experiment  One 

I'hc  Hrst  experiment  perft)iTned  was  designed  to  measure  the  ability  of  tlie  \oter  to  syneiirom/e  the 
subiasks  when  one  subtask  is  continuously  slower  than  the  other  siibtasks.  The  frequency  of  voting  (or 
granularity  of  tlie  subtasks)  was  varied,  and  the  execution  speed  of  one  subuisk  was  varied.  The  queue  lengths 
of  the  voters  were  recorded  as  a  measure  of  how  far  tire  slow  subta.sk  fell  behind  the  tw  o  faster  subtJ.sk.s.  I'hc 
slower  subtask  performed  10%  to  50%  more  operations  in  calculating  the  next  value.  The  slower  subtask 
represents  a  process  that  requires  more  execution  time  due  to  an  instruction  retry,  or  due  to  an  interrupt  that 
it  must  handle.  In  these  situations,  one  subtask  will  be  temporarily  slower;  but  as  these  experiments  show,  it 
would  be  ill-advised  to  design  a  system  where  one  subtask  was  contimiously  slower  (this  experiment  shows 
design  constraints  for  systems  that  have  one  continuously  slower  subtask),  l-.ich  voter  recorded  the  length  of 
the  voter  queue  every  time  a  new  iteration  was  received.  The  queue  length  information  was  sent  as  a  message 
to  a  prcK'css  that  stored  the  data  in  a  file.  The  recording  of  tlie  queue  length  added  some  ovchead  to  the 
voter,  but  each  voter  paid  the  same  cost. 

The  queue  length  was  plotted  versus  die  iteration  number  for  two  different  granularities  and  various 
subtask  degradation  as  shown  in  Kigurcs  5-1  and  5-2.  For  granularity  equal  to  1024  operations,  one  subtask 
can  be  up  to  10%  slower  and  the  queue  length  stays  at  one.  This  implies  diat  the  voter  overhead  is  great 
enough  so  that  the  differences  in  speed  are  masked.  F'or  larger  differences  in  speed,  the  queue  length  grows 
to  a  value  and  then  levels  off.  ITic  queue  length  is  bounded  due  to  an  increase  in  voter  execution  time  as  the 
queue  length  increases.  'ITic  voter  must  search  for  die  iteration  number  in  the  queue  and  the  search  proceeds 
linearly.  I  he  subtask  that  is  slower  will  not  pay  this  overhead  cost  since  it  has  /;-/  messages  waiting  for 
processing,  where  n  is  the  queue  length. 

As  the  granularity  increases,  the  queue  length  grows  more  rapidly.  With  granularity  equal  to  1024  (Figure 
5-1).  the  10%  to  40%  additional  operations  curves  appear  to  be  bounded  but  the  50%  additional  operations 
curve  is  not  bounded.  The  curves  for  granularity  equal  to  16,384  (Figure  5-2)  do  not  appear  to  have  a 
bounded  queue  length.  I'his  is  due  to  the  fact  that  the  voter  overhead  takes  a  smaller  percentage  of  the  total 
execution  time  for  the  larger  granularity  cases.  The  voter  overhead  is  a  fixed  value  for  a  specific  queue  length. 
W'hen  the  slower  subtask  takes  approximately  the  same  amount  of  time  as  the  voter,  dicn  the  voter  overhead 
is  significant  in  comparison  to  the  subuisk  execution  time.  While  the  normal  subtasks  arc  waiting  for  the 
voter  to  generate  a  voted  data  value,  the  slower  subtask  can  be  calculating  a  data  value  for  one  of  die  old 
messages  (when  the  queue  length  is  greater  than  one.  the  slower  subtask  will  have  data  values  to  calculate  for 
all  the  messages  in  the  queue). 


Queue  Length  Queue  Length 
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Figure  5-1 :  Granularity  equal  to  1024.  one  subtask  always  slower 


Time  (in  Jtt  of  votes) 
Voter  Queue  Length  vs  Time 


Figure  5-2:  Granularity  equal  to  16,384,  one  subtask  always  slower 
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5.2.  Experiment  Two 

Hie  second  experiment  is  a  \ariation  on  the  first  experiment  .md  v\as  designed  to  explore  the  s\nchroni/ing 
nature  of  xoters  more  fiilK.  in  tliis  experiment,  one  sulnask  is  slower  ih.in  (he  oilier  two  subiasks  by  a 
percentage  for  a  period  of  time,  then  the  same  sulnask  is  faster  tJian  tlie  other  subi.isks  for  tlie  same  period, 
file  period  was  chosen  tt)  be  20  iterations.  For  example,  subuisk  .\  will  perform  Wc  more  operations  in 
calculating  the  first  20  data  values,  followed  In  performing  10%  fewer  operations  for  the  next  20  iterations. 
Subtask  A  w  ill  therefore  spend  10%  more  time  executing  the  first  20  iterations  tJian  the  second  20  iterations. 

While  the  subtask  is  operating  slower,  the  queue  length  should  behave  exactiv  the  same  as  in  experiment 
vine.  Once  tlie  subusk  is  faster  llian  the  others,  tliis  sulnask  should  quickly  catch  up  resulting  in  a  decline  in 
tile  queue  length.  The  rate  of  decline  in  queue  length  should  be  greater  tJi.in  the  rate  of  increase,  since  w  hen 
the  queue  has  length  greater  than  one  tlie  subtask  being  varied  does  not  have  to  wait  for  tire  voter  to  finish 
before  beginning  die  next  data  value  calculation. 

The  first  plot  of  queue  length  versus  iteration  number  with  granul.irity  equal  U'  1024  (|■■igure  S-.i)  shows  the 
expected  result.  The  queue  length  increases  when  subuisk  A  is  slower  and  the  rate  of  increase  is  tlie  same  as 
iliat  from  experiment  one.  As  soon  as  subuisk  .\  begins  executing  fewer  operations  per  iteration,  the  queue 
length  declines  rapidly,  reaching  queue  length  equal  to  one.  If  tlie  granularity  is  increased  to  16.384  (Figure 
5-4)  then  the  queue  length  is  not  restored  to  one.  and  there  is  a  net  increase  in  the  queue  length  over  time. 
The  queue  length  increases  because  subtask  ,\  will  be  spending  more  time  executing  tlie  long  calculations. 

5.3.  Experiment  Three 

The  third  experiment  is  similar  to  experiment  two.  except  it  represents  a  more  realistic  class  of 
synchronization  problems.  A  subtask  that  is  performing  a  calculation  may  experience  a  temporary  slowdown, 
followed  by  a  period  of  normal  behavior  such  as  a  subuisk  which  has  to  perfonn  a  recovery  routine  because  of 
a  bus  error  or  has  to  perform  a  one  time  operating  system  task.  Is  the  processor  miming  the  subuisk  dtximcd 
to  stay  behind,  or  will  it  eventually  catch  up  even  though  it  always  takes  as  long  to  calculate  a  new  daui  value 
as  the  others?  As  soon  as  a  subuisk  falls  behind,  it  no  longer  pays  the  overhead  cost  since  it  has  messages 
queued  up  waiting  for  processing.  This  fact  would  imply  that  a  subtask  can  catch  up.  and  the  rate  at  which  it 
catches  up  is  the  incremental  voter  overhead  cost  per  iteration. 

The  experiment  can  be  described  as  follows:  one  subuisk  will  do  additional  operations  (10%  to  50%)  for  20 
iterations  followed  by  a  period  of  normal  behavior  (performing  the  same  number  of  operations  as  the  other 
subtasks).  The  results  of  the  experiment  arc  shown  in  Figures  5-5  and  5-6.  It  can  be  seen  that  during  the 
periods  of  normal  operation  for  all  three  subtasks,  the  queue  length  declines,  and  given  a  long  enough  period 
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Granularity  =  IK  +  / 
Granularity  =  1K  ■*■/ 
Granularity  =  IK  ♦/ 
Granularity  =  1K  +/ 
Granularity  =  IK  »  / 


10%  speed  variation  every  20  votes 
20%  speed  variation  every  20  votes 
30%  speed  variation  every  20  votes 
40%  speed  variation  every  20  votes 
50%  speed  variation  every  20  votes 
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Figure  5-3:  Granularicy  equal  to  1024,  one  subiask  slower  half  the  time,  faster  half  Uic  time 
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Granularity  =  16K  +/•  30%  speed  variation  every  20  votes 
Granularity  =  16K  +  /•  40%  speed  variation  every  20  votes 
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Voter  Queue  Length  vs  Time 

Figure  5-4:  Granularity  equal  to  16,384.  one  subtask  slower  half  the  time,  faster  half  the  time 


of  normal  behavior  would  reach  one.  The  rate  of  decline  of  queue  length  during  normal  subtask  behavior 
indicates  the  effect  of  voter  overhead  on  the  subtasks. 
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Voter  Queue  Length  vs  Time 


Figure  5-6:  Granulanty  equal  to  16,384,  one  subtask  slower  half  the  time,  same  half  the  time 
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5.4.  Experimental  Conclusions 

1  he  ilircc  cxpcninoius  pcitomiciJ  give  .i  cle.ir  piclure  iif  a  svncliroiii/.uioii  model  for  ihe  equal  siibiasks 
paradigm.  I  here  appear  U)  he  two  factors  insohed  in  the  model.  The  factors  arc: 

1.  ’l'herc  IS  a  minimum  voter  overhead  that  is  due  to  the  time  required  by  the  voter  to  receive  a 
message,  handle  the  data,  and  vote  on  die  data.  The  subtasks  tJiat  htivc  a  queue  length  of  one 
must  pay  tJiis  overhead  cost  every  iteration  of  the  experiment. 

2.  Ihe  overhead  cost  increases  ;is  the  voter  queue  length  increases  due  to  an  increase  in  the  data 
handling  cost.  I'his  factor  would  indicate  that  for  a  long  enough  queue,  die  voter  could  mask  any 
difference  in  process  speed.  I'or  practical  queue  lengths,  though,  die  increase  in  voter  overhead 
masks  only  some  of  the  subtask  speed  variation. 

The  synchronization  experiments  can  give  some  design  principles  for  I'MR  asynchronous  voting  systems. 
I'hese  principles  can  be  .ipplicd  to  tiptimi/c  die  voter  queue  length,  to  choose  a  subtask  granularity,  and  to 
detertiiine  the  amount  of  process  speed  variation  allowed  in  a  design.  Proper  applicvition  of  the  principles  will 
lead  to  a  design  that  will  have  a  bounded  queue  length  for  all  possible  variations  in  process  execution  rate. 
ITic  principles  can  be  summarized  as  follows; 

1.  Smaller  granularity  subtasks  have  a  higher  probability  of  having  a  bounded  queue  length. 

2.  .As  subtask  granularity  increases,  the  random  variation  in  priKcss  speed  becomes  increasingly 
important  in  ensuring  a  bounded  queue  length. 

3.  Greater  voter  overhead  allows  a  greater  variation  in  process  execution  rate.  This  yields  an 
interesting  trade-off  in  voter  design,  since  a  faster  voter  process  will  increase  system  diroughput 
but  w  ill  decrease  die  amount  of  variation  permitted  in  prtxess  execution  rate. 

These  results  can  be  generalized  for  synchronous  voting,  as  well  as  asy  nchronous  voting.  If  the  maximum 
voter  length  is  fixed  at  one.  then  the  system  is  synchronous  like  SIP'l  [2]  [3]  [4]  and  C.vmp  [8]  [11].  Iloth  of 
dicse  NMR  systems  use  a  synchronous  voter  with  queue  length  of  one.  C.vmp  has  a  hardware  voter  widi  a 
built  in  wait  feature,  i’hc  length  of  die  wait  corresponds  to  the  voter  overhead  in  diese  experiments.  SIFf 
uses  fixed  scheduling,  so  a  vote  proceeds  when  the  next  time  slot  begins.  The  voter  overhead  corresponds  to 
die  design  margin  in  the  fixed  schedule  (die  time  between  the  end  of  die  process  execution,  and  the  end  of  the 
time  slot). 
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5.5.  Synchronization  Modeling 

lliis  sccunn  will  prcsciu  -i  inudi-'l  oi' ihc  \olcr  quoiio  k'ligili  h.isod  on  ni.inul.ii ii\ .  pcrLcm  Jindicin-O  in 
Mihi.isk  c\ccutu>n  speed.  ,ind  nine.  1  he  model  is  eomp.ned  lo  ihe  .letu.il  e\penniem.il  resnlis,  Init  I'lisi  ihe 
lel.uionship  beiw  eon  the  model  jnd  die  IMk  experiment  should  he  explmncd. 

5.5.1 .  Queue  Length  Models 

llie  IMR  s\stom  explained  in  die  prexioiis  section  lias  queues  dial  tontain  the  messages  heme  p.issed 
hetween  the  subtasks  and  die  \otcrs.  Hach  \oter  has  three  queues  in  which  to  receoe  mess.ices.  and  each 
suht.isk  has  one  queue  in  which  to  receue  messages.  The  suhtask  mess.ige  queue  ..in  he  lewed  in  the  light  of 
gener.il  queuing  ihcors.  The  queue  will  liaxe  a  birth  rate.  A.  .iiid  a  de.idi  r.iie.  ll.isic  queuing  dieor;. 
.issumes  diat  both  A  and  are  constant.  .\Iso,  the  birth  rate  must  be  less  than  die  de.ith  i.ne  m'  ih.at  die  queue 
length  will  be  bounded.  The  seners  of  the  queue  ha\e  a  utili/.ition  ot'A/u.  1  he  utiii/.ition  wli  'e  less  dian 
v'lie.  There  -ire  two  problems  with  using  a  simple  queuing  mode!  for  .('ter  ssnclironi/atioiv  1  he,,  ire  .r.il  die 
birth  rate.  A.  is  not  constant  and  diat  the  birth  rate  is  not  less  than  the  de.ith  r.ne  for  nnwi  .''  the  exper-ments 
pertonned  (the  queue  length  grows,  therefore  A  is  greater  than  it)  In  spite  I'f  die'C  proiS^tuv  ,i  queuing 
model  can  be  developed. 

Refore  a  queuing  model  is  presented,  some  background  anaKsis  of  the  prextous  section's  data  will  be  done. 
Hxperiment  one.  in  which  one  subtask  was  continually  slower,  will  be  used  m  dc'-eloping  the  mode!  of  queue 
bchaxii'r.  Fach  experimental  curve  m  the  precious  section  begins  to  peak  as  time  proceeds.  The  queue  length 
grows  less  rnpidl;.  as  the  quetie  length  increases.  The  queue  length  appeals  to  approach  some  bound  diat  is 
dependent  on  die  granularity  and  die  difference  in  execution  speed.  Some  curves  ha\c  observable  bounds. 
The  information  from  all  the  experiment  one  curves  presemed  could  be  summarized  if  this  bound 
information  could  be  collected.  If  the  queue  had  a  maximum  possible  value,  then  each  curve  cither  remains 
below  die  maximum  or  rises  above  the  maximum.  If  a  curve  has  a  maximum  value  greater  than  the 
maximum  queue  length,  then  die  queue  will  overflow  during  the  cxpcrimeiit.  and  is  unbounded  by  this  queue 
length.  Otherwise,  the  curve  is  bounded  by  the  queue  length.  For  three  different  maximum  values  of  the 
queue  length,  the  bounded  regions  and  unbounded  regions  arc  shown  in  F'igurc  5-7.  In  designing  a  system, 
the  maximum  queue  length  can  be  chosen,  and  this  will  dctcnninc  the  acceptable  granularities  and  siibtask 
execution  speed  differences  to  prevent  the  aueuc  from  overflowing.  I'hc  curves  tliat  determine  the  regions 
appear  to  be  linear  on  the  log  versus  log  scale.  I'hat  implies  that; 

log;  Granulahiy+  logj  PercenlDiJference  =  constant 
therefore. 

Granularity X.  PercentDiJJerence=constant=  VoterOverhead 
This  result  indicates  that  for  a  given  queue  length,  the  granularity  of  the  subtasks  is  inversely  proportional  to 
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Bounded  and  Unbounded  -  1  Subtask  Slower 

Figure  5-7;  Summar\  of  Experiment  One  I3aia 

the  percent  difference  in  processor  speed.  'Che  constant  is  a  number  of  operations  'xhich  is  dependent  on  the 
voter  overhead.  A  first  approximation  would  equate  tJiis  number  of  operations  to  the  voter  overhead  for  one 
iteration.  The  voter  overhead  is  constant  along  a  boundary  separating  the  bounded  and  unbounded  regions. 
A  subusk  can  be  constantly  slower  by  a  number  of  operations  (the  voter  overhead)  and  still  only  fall  some 
constant  number  of  iterations  behind  die  other  subtasks.  Next,  the  value  of  the  bound  can  be  determined  for 
any  granularity  and  percent  difference. 

From  the  experimental  data,  the  value  of  the  voter  overhead  per  iteration  (the  number  of  operations  slower 
one  subtask  may  be  and  not  fall  further  behind)  can  be  plotted  against  die  bound  on  the  queue  (the 
maximum  queue  length).  Figure  5-8  shows  the  data.  A  linear  least  squares  fit  was  determined  for  the  dau. 
Chis  equation  can  predict  the  maximum  queue  length  for  a  given  granularity  and  percentage  difference  in 
subusk  speed.  Che  equation  is: 

MaximumQucuef.cn^ilii  .\f)  =  0.0457  I'ud’rOverhead—  4.0  ( 10) 

or 

VolcrO\erht'aJ=2l.O-  M  +  92.2  (II) 

Equation  10  is  fairly  accurate  in  predicting  the  bound  on  the  queue,  but  some  of  the  variation  in  the  data 
remains  unexplained.  Hqualion  11  can  predict  how  much  variation  in  subusk  execution  speed  is  allowed 
given  a  maximum  queue  length.  Note  that  even  when  the  maximum  queue  length  is  zero  (a  toully 
synchronous  voting  system),  some  variatior  in  subusk  execution  speed  is  allowed.  In  fact,  this  model 
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Figure  5-8:  Maximum  Queue  I  ength 

indicates  that  one  subtask  can  be  constantK  92  operations  slower  and  never  fall  behind.  This  is  the  minimum 
voter  overhead,  tiie  time  tlic  voter  takes  to  process  two  inputs.  I  his  overhead  is  one  component  of  the  value 
of  k  presented  in  Section  4. 

5.5.2.  Queuing  Theory  Model 

.\  subtask  tliat  has  a  slower  execution  rate  than  the  two  other  subiasks  will  fall  behind  in  executing  each 
subtask  iteration.  For  every  iter.ition  the  slow  sublask  is  behind.  Uu'  subtask  queue  will  contain  a  message. 
I  he  queue  length  will  grow  as  long  as  the  subtask  execution  rate  is  greater  Uian  the  voter  execution  rate.  In 
die  previous  section  it  was  shown  that  the  voter  execution  time  is  dependent  on  the  length  of  the  queue.  In 
fact,  as  the  queue  length  grows,  the  voter  takes  longer  to  execute.  This  will  result  in  a  decreasing  growth  rate 
for  the  subtask  message  queue.  Now  a  model  can  be  formalized. 

/  =  the  queue  length 

MI  )=  the  birili  rate,  a  ftinction  of  the  queue  length 
/i  =  the  queue  death  rate 

if  >  1  then  the  queue  grows 

when  —  -  -  =  I  then  the  queue  length  is  in  steady  state 


A(Z.)-fi  =  g(T)=  the  growth  rate  of  the  queue  length 


Voter  Queue  l.engih  1-xperimcnts 
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\l-  maxiiTuim  queue  length  in  steady  state 
and 

A  =  percentage  decrease  in  growth  rate  for  each  unit  increase  in  /. 

The  maximum  queue  length  equation  was  derived  in  die  previous  subsection.  Using  diis,  the  value  of  a 
can  be  determined.  .\t  the  suirt  of  each  experiment,  the  length.  / .  will  he  zero.  So. 

i^l)  =  xM  when  /.  =  0 

Ihis  model  indicates  that  die  initial  growth  rate  is  onlv  dependent  on  the  maximum  queue  length,  and  a 
constant  percentage.  The  growth  rate  is  simply  die  slope  of  the  curve.  Since  the  growth  rate  can  bo 
expenmcniaUy  determined,  the  value  of  a  can  be  found.  I'he  growth  rate  when  /  =0  w.is  determined  for  a 
number  of  the  experimental  curves.  From  this  infomiation.  the  value  of  x  w, is  determined  to  be; 

16 

V  -  — 

( iranuhiriiy 

Ihis  result  h.is  no  known  significance,  but  is  accurate  over  .ill  valiK's  of  gr.inul.iritv  .ind  percent  difference  in 
execution  speed  considered  in  the  experiments. 


Using  the  abxive  results,  the  growth  rate,  which  is  simply  the  change  in  queue  length  over  time,  can  be 


written  as: 

I  he  value  of  1.  in  the  above  equation  is  a  function  of  time,  so; 

/.(/)  + -^^^  /,(0)  =  0 

di  (trail  Gran 

I  he  solution  to  this  differential  equation  is: 

/(r)=,U(l-  exp  (^^)) 

Since  .1/  is  experimentally  known,  then  the  queue  length  can  be  plotted  against  time,  for  various  granularities, 
and  percent  differences  in  subtask  execution  speed. 


5.6.  Comparison  of  Model  and  Experiment 

Five  experimentally  determined  curves  arc  compared  to  five  predicted  curves  in  Figure  5-9.  I'he  predicted 
results  are  very  similar  to  the  experimental  results.  I'he  model  seems  to  be  good  at  predicting  the  queue 
length.  This  model  docs  not.  however,  take  communication  costs  into  account.  When  the  granularity  is  small, 
the  Cm*  interprocess  communication  costs  become  significant  causing  each  subtask  iteration  to  have  a  greater 
execution  time,  rhercforc  the  model  is  not  accurate  for  small  granularities.  Another  problem  with  the  model 
is  that  it  can  sometimes  predict  a  maximum  queue  length  too  l.irgc,  and  at  other  times  can  predict  a  maximum 
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Granularity  i  IK  ♦  10*0  slower 
Granularity  =  IK  »  20%  Slower 
Gianularity  =  IK  +  30%  slower 
Granularity  =  1K  +  40%  slower 
Gianularity  =  IK  +  50%  slower 
Granularity  =  IK  +  10%  slower,  predicted 
Granularity  =  IK  r-  20%  Slower,  predicted 
Granularity  =  IK  »  30%  slower,  predicted 
Granularity  =  IK  ♦  40%  slower,  predicted 
Granularity  =  IK  ♦  50%  slower,  predicted 


1  -S-  •£> — □ 
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Voter  Queue  Length  vs  Time 

Figure  5-9:  Predicted  Queue  Length  and  Actual  Queue  Length 

queue  letigth  too  small.  ITie  predictions  arc  not  consistently  too  high  or  too  low.  ITie  model  has  scteral 
derived  parameters.  ITie  value  of  x  was  found  experimentally,  and  the  equation  found  for  calculating  A/  is 
based  on  a  least-squares  fit  in  which  some  points  arc  outlying. 


The  queuing  model  of  the  voter  synchronization  experiments  can  explain  a  large  portion  of  the  variation  in 
the  experimental  results.  Some  of  the  model  parameters  are  difficult  to  determine,  but  they  can  be 
approximated,  llie  comparison  of  the  predicted  and  actual  results  shows  that  the  model  has  the  proper  form 
in  order  to  explain  the  experimental  results.  By  changing  the  model  parameters  slightly  to  account  for  Cm* 
perturbations,  the  model  can  explain  most  of  the  experimental  results. 


Conclusion 
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6.  Conclusion 

This  paper  lias  explored  some  of  die  auribulcs  of  NMk  compuier  s>siems.  Main  features  of  software 
\oters  have  been  explored  both  expeiimentally  and  theoretically.  Section  2  has  presented  some  software 
voting  concepts.  N-modular  redundancy  has  been  described  and  the  software  concepts  of  time  skew  and 
space  redundancy  have  been  explained.  Various  synchronization  issues  have  been  presented,  including 
time-outs,  points  of  synchronization,  and  asynchronous  versus  synchronous  systems.  Hie  frequency  of  voting 
and  die  data  granularity  were  shown  to  be  important  factors  in  determining  die  reliability  of  NMR  systems. 
Kinally,  a  technique  was  described  to  allow  easy  bit-by-bit  voting  on  words  of  data. 

In  Section  4,  some  experiments  were  presented  to  help  measure  die  overhead  involved  in  software  voting. 
The  type  of  voter,  die  voting  frequency,  and  die  average  insta'ction  execution  time  were  incorporated  into  a 
model  of  voting  overhead^  I'hc  model  was  shown  to  accurately  describe  die  experimental  data  and  an  analysis 
of  the  programs  yielded  upper  and  lower  bounds  on  the  possible  overhead,  'i'hc  voting  frequency  was  shown 
to  be  the  dominant  factor  in  determining  the  voting  overhead. 

'Section  5  shows  a  number  of  synchronization  experiments.  The  amount  of  variation  in  prcxtcss  execution 
speed  that  can  be  tolerated  was  determined  for  three  different  types  of  variation.  The  length  of  one  voter's 
queue  was  measured  over  time  to  determine  how  far  a  process  can  fall  behind  t^o  other  processes.  I'hc 
quctic  length  was  shown  to  have  a  bound  even  when  one  process  is  continually  slower  than  die  other 
processes.  Guidelines  for  designing  reliable  NMR  systems  were  presented,  based  on  the  experimental  results. 
A  queuing  mvidcl  was  developed  to  describe  the  length  of  a  subuisk's  queue  over  time  for  any  amount  of 
variation  in  process  execution  rate.  The  model  was  shown  to  accurately  predict  the  experiments  over  a  range 
of  values. 

Many  of  the  ideas  mentioned  in  diis  article  could  be  developed  more  fully.  The  reliability  of  asynchronous 
versus  synchronous  systems  could  be  explored,  and  the  concept  of  time  skew  redundancy  could  be  the  basis 
for  reliability  studies.  The  assertion  was  made  that  as  the  voting  frequency  increases  there  is  a  point  at  which 
the  reliability  of  the  system  will  decrease.  Ihis  seems  intuitive,  yet  could  probably  be  proven  experimentally 
or  mathematically. 
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Conclusion 


I.  Voter  Algorithms 

1.1 .  Simple  Synchronizing  Voter 

initialize 
I  .(H)p  forever 

Kor  i  =  1  to  N 

Receive  msg  i 
Classify  msg 
If  majority  found  tlien 
send  msg 

end  For 
Report  errors 
end  1  .oop 


1.2.  Optimized  Voter  with  internal  queue 

Initialize 
I  .oop  forever 

Conditional  Receive  next  msg 
If  voter  buffer  full  tlien 

attempt  to  receive  missing  msgs 
I  f  majority  received  then 
vote 

report  errors 
initialize  oldest  msg  slot 
store  msg 

If  majority  arrived  &  not  majority  found  then 
vote 

If  all  msgs  arrived  then 
report 

initialize  msg  slot 

end  Loop 

1.3.  Sequence  Number  Voter 

Initialize 
Loop  forever 

Conditional  Receive 

If  illegal  sequence  number  then 

reportC  illegal  sequence  number”) 


Voter  Algorithms 


Hlsc 

Search  for  seq  num  in  queue 
If  seq  num  found  dien 

If  subiask  already  sent  this  seq  num  then 
reportC  seq  num  duplicated") 

Hlsc 

store  msg 

If  majority  received  then 
vote 

If  all  msgs  arrived  then 
Ifset  no  oldest  tlien 

reporU"  complete  set  not  oldest  ) 

Else 

report 

initialize 

Else  if  seq  num  not  found  then 
If  queue  full  then 

handle  oldest  msg 
store  msg  in  new  queue  slot 


end  l.oop 
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